Optimizing Pandas DataFrames

Filter first!

data science
data engineering
Published

December 28, 2022

🏁 🏁 Optimizing Pandas DataFrames 🏁 🏁

Created by:Claudia López

When chaining multiple operations it is worthwhile to think about which operations to execute first in order to optimize the sentence. Filter steps should be executed as early as possible

It is always recommended to filter the data where the data lives, for example, in other words in BigQuery, but if this is not the case, you should filter your dataframe as soon as possible to only work with the data you need, thus optimizing your operations.

⚙️ Precondition: Getting Data

import pandas as pd
import numpy as np
import time

titanic_data = pd.read_csv('test.csv')
passenger_id = 895
titanic_data
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
413 1305 3 Spector, Mr. Woolf male NaN 0 0 A.5. 3236 8.0500 NaN S
414 1306 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C
415 1307 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S
416 1308 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S
417 1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C

418 rows × 11 columns

🔧 Filtering Methods

Pandas provides a lot of methods for data selection, below you can review some methods.

1. Index []

Indexing is the easier way of filtering the data where the condition expression creates a Boolean series, and we can use it to filter the DataFrame.

start = time.time()

#Filter using []
passengers_by_ticket_index = titanic_data[(titanic_data['PassengerId'] == passenger_id)]

# Duration time
print(f"time : {(time.time() - start) * 1000} ms")
time : 1.646280288696289 ms

2. query()

This is a much cleaner and easier way to filter rows. Also, query() supports much more complicated conditional expressions and is faster than using [].

start = time.time()

#Filter using query
passengers_by_ticket_query = titanic_data.query('PassengerId == @passenger_id')

# Duration time
print(f"time : {(time.time() - start) * 1000} ms")
time : 6.360054016113281 ms

3. eval()

The eval() function in Pandas uses string expressions to efficiently compute operations using DataFrame.

start = time.time()

#Filter using eval
passengers_by_ticket_eval = titanic_data[titanic_data.eval('PassengerId == @passenger_id')]

# Duration time
print(f"time : {(time.time() - start) * 1000} ms")
time : 4.781246185302734 ms

⭐ Evaluation

✅ The traditional method [] is faster for smaller arrays.

✅ The benefit of eval and query is mainly in the saved memory, and the sometimes cleaner syntax they offer.

✅ The advantages of eval and query lies in humongous dataset.

✅ It is recommended to use eval or query when you work with a lot of data.

👀 Optimizing the eval(), query(), [] filter operations would not necessarily guarantee performance improvement because it’s a multivariate equation.

🔍 Recommendatios for Faster Lookup

Here, there are some strategies and properties useful to lookup.

# Variables
position = 3
column = "Ticket"

Index Optimization

Pandas has optimized operations based on indices whereby It is recommended to use an index in dataframes to allow for faster lookup.

Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.

# 1. We identify PassengerId as a candidate variables to use as index
titanic_data_indexed = titanic_data.set_index("PassengerId", drop=False, inplace=False)

start = time.time()

#2. Filtering a dataframe using PassengerId column
passengers_by_ticket_eval = titanic_data_indexed[titanic_data_indexed.eval('PassengerId == @passenger_id')]

#3. Duration time
print(f"time : {(time.time() - start) * 1000} ms")
time : 4.584312438964844 ms

Lookup a Single Value

When we need to retrieve a single value from a dataframe it’s recommended to use .at[] because is faster than using .loc[].

.loc

The .loc property of the DataFrame object allows the return of specified rows and/or columns from that DataFrame.

df.loc[rows,columns]

Note: .loc is not a method, it is a property indexed via square brackets.

start = time.time()
passenger = titanic_data.loc[position, column]
print(f"time : {(time.time() - start) * 1000} ms")
time : 0.2779960632324219 ms

.at

Access a single value for a row/column label pair.

This method works in a similar way to Pandas .loc[] but .at[] is used to return an only single value that’s because is faster.

start = time.time()
passenger = titanic_data.at[position, column]
print(f"time : {(time.time() - start) * 1000} ms")
time : 0.2944469451904297 ms

Vectorize Operations

Vectorization is the process of executing operations on entire arrays. Similarly to NumPy.

It is recommended to avoid for loops when working with dataframes, because read and write operations are expensive. When looping is unavoidable, use native NumPy, or .map() for simple operations.

###Verify Memory Usage

Every time when you work with a dataframe verify the memory usage, to achieve this you can use the functions: info() or memory_usage().

.info()

Show a concise summary of a DataFrame.

titanic_data.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

.memory_usage()

This function return the memory usage of each column in bytes.

titanic_data.memory_usage(index=False, deep=True)
PassengerId     3344
Pclass          3344
Name           35314
Sex            25802
Age             3344
SibSp           3344
Parch           3344
Ticket         26700
Fare            3344
Cabin          16022
Embarked       24244
dtype: int64

Memory Optimization

When the dataset is read using Pandas read function like read_csv or read_excel, Pandas decides the data type and loads it into RAM. Normally for integer values Pandas assign int64, float values are assigned float64, and string values are assigned as objects, The problem here is that using an int64 takes up more memory compared to int8 (8 times more).

The idea is to downgrade the datatype reviewing el max and min value of a column and choose which is the correct data type for a specific column

Visit Data types in Python

Optimizing Memory step by step

# 1. Get info about the dataframe:
titanic_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
# 2. Show memory usage for the PassengerId column: 3344 bytes
titanic_data.memory_usage(index=False, deep=True)
PassengerId     3344
Pclass          3344
Name           35314
Sex            25802
Age             3344
SibSp           3344
Parch           3344
Ticket         26700
Fare            3344
Cabin          16022
Embarked       24244
dtype: int64
#3. Get the Max and Min value for PassengerId column and decide the best datatype:
min_value = titanic_data["PassengerId"].min()
max_value = titanic_data["PassengerId"].max()
print(f"Min Value: {min_value} - Max Value: {max_value}")
Min Value: 892 - Max Value: 1309
Analysis
  • ▶ The PassengerId column has values between 892 to 1309.

  • ▶ This range does not contain negative numbers.

  • ▶ Pandas assigned int64 to this column, but the range of int64 is Integer (-9223372036854775808 to 9223372036854775807):

    • 🔴 It’s a wide range.

    • 🔴 Allows negative numbers.

    • 🟢 We can review the range of the different data types in the above table and choose the best range.

    • 🟢 So, we finally decide to use uint16, but why is it the best option?

      • ✔ range of uint16 is: Unsigned integer (0 to 65535)

      • ✔ The range is enough to contain the values for the PassengerId Column.

      • ✔ We only need a positive number

#4 . Set PassengerId column to uint16 datatype:
titanic_data["PassengerId"] = titanic_data["PassengerId"].astype("uint16")
# 5. Show memory usage for the PassengerId column again:
titanic_data.memory_usage(index=False, deep=True)
# PassengerId has reduced from to 3344 bytes 836 bytes
PassengerId      836
Pclass          3344
Name           35314
Sex            25802
Age             3344
SibSp           3344
Parch           3344
Ticket         26700
Fare            3344
Cabin          16022
Embarked       24244
dtype: int64
# 6. Get info again:
titanic_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    uint16 
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(3), object(5), uint16(1)
memory usage: 33.6+ KB

The memory has been reduced from 36.0+ KB to 33.6+ KB

🔖 Summary

✅ Filter steps should be executed as early as possible.

✅ Filter for a single value, .at is a good choice.

✅ The method [] is faster for smaller arrays.

✅ Verify the memory usage with memory_usage method

✅ Downgrade the datatype reviewing el max and min value of a column and choose which is the correct data type for a specific column.

✅ Using an index in dataframes to allow for faster lookup.

❌ Try to avoid for loops, but if you can’t avoid them, use .map().